{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center> <h1> Naive Bayes</h1></center>\n",
    "\n",
    "# 1. Generative models\n",
    "We can easily transform a supervised classification problem to a unsupervised problem by using the Bayes law.\n",
    "\\begin{align*}\n",
    "p(y=c|\\vec{x},\\theta) &=\\frac{p(\\vec{x}|y=c,\\theta)p(y=c)}{p(\\vec{x}|\\theta)} \\\\\n",
    "                      &=\\frac{p(\\vec{x}|y=c,\\theta)p(y=c|\\theta)}{\\sum_{c'} p(\\vec{x}|y=c',\\theta)p(y=c'|\\theta)}\n",
    "\\end{align*}\n",
    "So the supervised classification problem becomes a density estimation problem.The key is specifying a suitable form for the class-conditional density $p(\\vec{x}|y=c,\\theta)$. Normallly, we can assume the form of the class-conditional density $p(\\vec{x}|y=c,\\theta)$ and estimate the $\\theta$ using the training dataset. The following is about how to assign the form of the class-conditional density and estimate $\\theta$.\n",
    "\n",
    "# 2. Beta-binomial model\n",
    "Suppose $\\vec{x}$ is a scalar variable and $x \\sim Ber(\\theta)$, the training dataset is $D=\\lbrace x_1,x_2,\\ldots,x_N \\rbrace$. Then the task becomes estimating $\\theta$. We shall use the bayes law\n",
    "$$p(\\theta|D)=\\frac{p(\\theta)p(D|\\theta)}{p(D)}$$\n",
    "where $p(D|\\theta)$ is the likelihood function and $p(\\theta)$ is the prior for the parameter and $p(D)$ is just a normalized factor.The likelihood function is easy to get\n",
    "\\begin{align*}\n",
    "p(D|\\theta) &=\\prod_{i=1}^N \\theta^{\\mathbb{1}(x_i=1)} (1-\\theta)^{\\mathbb{1}(x_i=0)} \\\\\n",
    "            &=\\theta^{\\sum_{i=1}^N \\mathbb{1}(x_i=1)} (1-\\theta)^{\\sum_{i=1}^N \\mathbb{1}(x_i=0)}\n",
    "\\end{align*}\n",
    "We need to specify a prior distribution for $\\theta$. The prior distribution of $\\theta$ should be very convenient for calculation. For the most of the time, the form of the prior distribution should be in the same form as likelihood function.In this problem, we use the beta distribution, which support over the interval $[0,1]$ and has the same form as the likelihood function.\n",
    "$$\n",
    "p(\\theta)=Beta(\\theta|a,b) \\propto \\theta ^{a-1} (1-\\theta)^{b-1}\n",
    "$$\n",
    "Then we can calculate the posterior distribution of $\\theta$\n",
    "$$\n",
    "p(\\theta|D) \\propto Beta(\\theta|N_1+a,N_0+b)\n",
    "$$\n",
    "# 3. Dirichlet-multinomial model\n",
    "Suppose $\\vec{x}$ is a scalar variable and $x \\sim Cat(\\vec{\\theta})$, the training dataset is $D=\\lbrace x_1,x_2,\\ldots,x_N \\rbrace$,where $x_i \\in \\lbrace 1,2,\\ldots,K \\rbrace $. Then the task becomes estimating $\\vec{\\theta}$. We shall use the bayes law\n",
    "$$\n",
    "p(\\vec{\\theta}|D)  \\propto p(D|\\vec{\\theta})p(\\vec{\\theta})\n",
    "$$\n",
    "The likelihood function is easy to get\n",
    "\\begin{align*}\n",
    "p(D|\\vec{\\theta}) & =\\prod_{i=1}^N \\prod_{k=1}^K \\theta_k^{\\mathbb{1}(x_i=k)} \\\\\n",
    "                  & =\\prod_{k=1}^K \\theta_k^{\\sum_{i=1}^N \\mathbb{1}(x_i=k)}\n",
    "\\end{align*}\n",
    "We need to specify a prior distribution for $\\vec{\\theta}$ satisfied $\\sum_{k=1}^K \\theta_k=1$. The form of the prior distribution should be in the same form as the likelihood function.In this problem, we choose the Dirichlet distribution\n",
    "$$\n",
    "Dir(\\vec{\\theta}|\\vec{\\alpha})= \\frac{1}{B(\\vec{\\alpha})}\\prod_{k=1}^K \\theta_k^{\\alpha_k-1}\n",
    "$$\n",
    "where $\\vec{\\alpha}$ is the hyperparameter.\n",
    "The posterior is also a Dirichlet distribution\n",
    "$$\n",
    "p(\\vec{\\theta}|D)=Dir(\\vec{\\theta}| \\alpha_1+N_1,\\ldots,\\alpha_K+N_K)\n",
    "$$\n",
    "# 4. Naive Bayes classifiers\n",
    "In the above, Beta-binomial model and Dirichlet-multinomial model are for scalar $x$. When $\\vec{x}$ is a multi-dimensional vector. We need to specify a multivariate class conditional distribution $p(\\vec{x}|y=c,\\theta)$ which is more difficult. The **Naive Bayes classifiers**  assume that the features are conditional independent given the class label,which means\n",
    "$$p(\\vec{x}|y=c,\\theta)=\\prod_{i=1}^d p(x_i|y=c,\\theta)$$\n",
    "which transform the multivariate problem to several univariate problem.\n",
    "The form of the class-conditional density depends on the type of each feature.\n",
    "\t\n",
    "* binary features, $x_i \\in \\lbrace0,1\\rbrace$, we can use the Bernoulli distribution:$p(x_i|y=c,D)=Ber(x_i|\\theta_{ic})$\n",
    "* categorical features, $x_i \\in \\lbrace 1, \\ldots, K \\rbrace$ , we can use the multinoulli distribution:$p(x_i|y=c,D)=Cat(x_i|\\overrightarrow{\\theta_{ic}})$\n",
    "* real-valued features, we can use the Gaussian distribution : $p(x_i|y=c,D)=N(x_i|\\mu_{ic},\\sigma_{ic}^2)$\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(150, 4)\n",
      "['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']\n",
      "['setosa', 'versicolor', 'virginica']\n",
      "Number of mslabeled points out of a total 30 points:2\n"
     ]
    }
   ],
   "source": [
    "from sklearn import datasets\n",
    "iris=datasets.load_iris()\n",
    "print(iris.data.shape)\n",
    "print(list(iris.feature_names))\n",
    "print(list(iris.target_names))\n",
    "from sklearn.naive_bayes import GaussianNB\n",
    "gnb=GaussianNB()\n",
    "from sklearn.model_selection import train_test_split\n",
    "x_train,x_test,y_train,y_test=train_test_split(iris.data,iris.target,test_size=0.2)\n",
    "y_pred=gnb.fit(x_train,y_train).predict(x_test)\n",
    "print(\"Number of mslabeled points out of a total %d points:%d\" %(x_test.shape[0],(y_pred!=y_test).sum()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5. Classifying document using bag of words\n",
    "**Document classification** is the problem of classifying text documents into different categories. One simple approach is to represent each document as a binary vector, which records whether each word is present or not, so $x_{ij}=1$ iff word $j$ occurs in document $i$. We can then use the Bernoulli distribution  for each word and each document categories.\n",
    "$$p(x_j|y=c,\\theta)=Ber(x_j|\\theta_{jc})$$\n",
    "This is very simple to use. However, ignoring the number of times each word occurs in a document loses some information.A more accurate representation counts the number of occurrences of each word. Specifically, let $\\vec{x_i}$ be a vector of counts for document i, so $x_{ij} \\in \\lbrace 0,1,\\ldots,n_i \\rbrace$, where $n_i$ is the number of terms in document i (so $\\sum_{i=1}^d x_{ij}=n_i$ ). For\n",
    "the class conditional densities, we can use a multinomial distribution\n",
    "$$\n",
    "p(\\vec{x}|y=c)=Mu(\\vec{x}|n,\\vec{\\theta_c})=\\frac{n!}{x_1! x_2!\\ldots x_d!} \\prod_{j=1}^d \\theta_{cj}^{x_j} \\\\\n",
    "n=\\sum_{j=1}^d x_j\n",
    "$$\n",
    "The parameters $\\theta_{cj}$ is estimated using Maximum A Posterior. The training dataset was D,which can be seperated into C subsets $T_c(c=1\\ldots C)$\n",
    "$$\n",
    "p(\\vec{\\theta_c}|D)=p(\\vec{\\theta_c})p(T_c|\\vec{\\theta_c})\n",
    "$$\n",
    "The likelihood function was \n",
    "$$\n",
    "p(T_c|\\vec{\\theta_c}) \\propto \\prod_{\\vec{x}\\in T_c} \\prod_{j=1}^d \\theta_{cj}^{x_j}=\n",
    "\\prod_{j=1}^d \\theta_{cj}^{\\sum_{\\vec{x}\\in T_c} x_j}\n",
    "$$\n",
    "We can choose the prior distributin of $\\theta_{cj}$ as Dirichlet distribution, which constrains that $\\sum_j \\theta_{cj}=1$\n",
    "$$\n",
    "p(\\vec{\\theta_c})=Dir(\\vec{\\theta_c}|\\alpha) \\propto \\prod_{j=1}^d \\theta_{cj}^{\\alpha_j-1}\n",
    "$$\n",
    "So we can get the posterior distribution of $\\theta_{cj}$ as following\n",
    "$$\n",
    "p(\\vec{\\theta_c}|D) \\propto \\prod_{j=1}^d \\theta_{cj}^{\\alpha_j-1+\\sum_{\\vec{x}\\in T_c} x_j} \\\\\n",
    "subject \\, to \\sum_j \\theta_{cj}=1\n",
    "$$\n",
    "By using the lagrange multiple method, we can solve the MAP problem.The MAP estimate of $\\theta_{cj}$ is\n",
    "$$\n",
    "\\hat{\\theta_{cj}}=\\frac{\\alpha_j-1+\\sum_{\\vec{x}\\in T_c} x_j}{\\sum_{j=1}^d\\left[\\alpha_j-1+\\sum_{\\vec{x}\\in T_c} x_j\\right]}\n",
    "$$\n",
    "The explaination of Multinomial Naive Bayes of sklearn is awasome."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(11314,)\n",
      "['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']\n"
     ]
    }
   ],
   "source": [
    "from sklearn.datasets import fetch_20newsgroups\n",
    "train_dataset=fetch_20newsgroups(subset='train',remove=('headers','footers','quotes'))\n",
    "test_dataset=fetch_20newsgroups(subset='test',remove=('headers','footers','quotes'))\n",
    "print(train_dataset.filenames.shape)\n",
    "print(train_dataset.target_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(11314, 101631)\n"
     ]
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "vectorizer=TfidfVectorizer()\n",
    "words_train=vectorizer.fit_transform(train_dataset.data)\n",
    "words_test=vectorizer.transform(test_dataset.data)\n",
    "print(words_train.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.682861129525057"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.naive_bayes import MultinomialNB\n",
    "from sklearn import metrics\n",
    "clf=MultinomialNB(alpha=.01)\n",
    "clf.fit(words_train,train_dataset.target)\n",
    "pred=clf.predict(words_test)\n",
    "metrics.f1_score(pred,test_dataset.target,average='macro')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(20, 101631)\n",
      "1.0000000000000007\n"
     ]
    }
   ],
   "source": [
    "print(clf.coef_.shape)\n",
    "import numpy as np\n",
    "print(np.sum(np.exp(clf.coef_[0])))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
